-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc2Vec getting back the same vector from infer_vector #374
Comments
The Even a vector from the bulk training is the product of a random process, which on subsequent runs can be quite different based on slight differences in setup. A different seed, or ordering of the training examples, or even just random scheduling jitter of a multi-thread process will all result in different end model/vector states. Each end-state will be about equally 'good' at the training goal – corpus word prediction – and thus should be roughly equally good at other outside tasks. But equally 'good' end states might have the vectors in very different coordinates. The inference tries to fit a later example into a frozen model, and so if you re-present the same document, it should wind up 'close' to the vector that the same document induced in multi-pass bulk training. But how 'close' would depend on a lot of things. The information in the PV paper about parameter choices is limited. Similar alpha/iteration choices as the original training seem a reasonable choice, but on the IMDB dataset, I've seen inferred vectors for doc X become closer to the bulk-trained vector for X than any other bulk-trained vectors with just a few steps at a larger alpha. (Perhaps a coarse approach is OK with the far-fewer free parameters?) If inference isn't coming close on your document set, maybe something else is wrong. Note that I also have the vague hunch that if the model is very large compared to the dataset size – eg thousands of dimensions, or the very large input/hidden layers of DM/concat mode – maybe there are many coordinate regions equally good at the train/inference prediction task, so any one inference doesn't necessarily match the earlier result. But I'm not sure of this. You can see examples of inference more-or-less working to match bulk-trained vectors, in the demo IPython notebook (in /docs/notebooks) or the /gensim/tests/test_doc2vec.py unit tests. |
Hi @gojomo , thanks for the reply :) SO, first off - I did test on words, not the TaggedDocument class. And I am creating a 300 element vector from 80000 documents. Is there any rule of thumb to make sure I do not create too many dimensions ? So, as far as I see it - either
Is there any logic behind what sort of datasets will work well with doc2vec ? Or is the only way of knowing this by running it ? I tried doing the following - I took 1 sentence from my corpus and trained doc2vec using this about 50,000 times. And when I gave the same sentence to infer_vector - it still gave me completely different vector. - Is this a good test to check that the vector should be nearly same or could the randomness change it drastically ? |
Nothing seems out-of-the-ordinary with your document count or chosen dimensionality. Only you would know if your text/vocabulary is so unique it could confuse this technique. Most of this is trial and error, starting with data preparation or parameters similar to published results, then trying and evaluating variants. To be clear, your bulk training should be using TaggedDocument instances, with 'words' a list of strings (the content) and 'tags' a list of non-word identifiers for the document (usually just a list with one item, a unique int ID). When that's done – after at least a few full passes, but often 10-20 – you can present lists-of-words to the How are you prepping your 80K sentences? What training mode and setup parameters are you using? When you finish initial training, and pick one of the 80K sentences, and request something like You'd not want to train excessively on a single sentence, as the overall point is to learn a general model that can represent a range of texts. The sanity checks around inferred-vectors in the test_doc2vec.py or IPynb notebook are better models. |
Hi, Subsequent calls to infer_vector with the same doc return different vectors: features1 = infer_vector(doc)
features2 = infer_vector(doc)
assert features1 == features2 #this will fail! How can I sent my model to return the same results? A known seed and a single thread? |
Some steps that might achieve that result are discussed in issue #447. |
dup #447 |
i have problem with inferred_vector |
Hi,
Right now, I tried the newly pushed doc2vec #356 and I was using the training data to test the model itself to check how good it was. I don't seem to be getting good results when using
infer_vector()
If I have 100,000 paragraphs of text and I run the Doc2Vec's train on it - I can see some vectors are created internally and can access with
docmodel.docvecs[tag]
. Now, when I try to runinfer_vector()
on one of the TaggedDocument that I learnt with, what values ofalpha
,min_alpha
andsteps
do I need to give to return back the same vector asdocmodel.docvecs[tag]
?Is it possible to get back the same vector ?
When I use the same alpha, min_alpha and use the value of iter as steps - I get a completely different vector, and their cosine distance is around 0.2 .
I get good results when I find similar terms using the
docmodel.docvecs[tag]
and pretty bad results withinfer_vector()
The text was updated successfully, but these errors were encountered: